ARDJ (Acceptability Rating Data for Japanese) is an ongoing research project that began in 2016. Its aim is to lay foundations for Evidence-based Linguistics (EBL) echoing the perspective of Evidence-based Mediciine (EBM) which implements the idea of hierarchy of effetive evidence, giving a top priority to randomized controlled trial (RCT).
In 2019, ARDJ yielded its first dataset, called “Survey 2 Unified”, based on a large-scale, web-based aquisition of four-points acceptability ratings to 468 Japanese sentences.
In 2020, ARDJ yields another dataset reporeted here, which we call “s1-s2 RT data”. It comprises a compliation of reaction-time (RT) data obtained from three groups of colledge students in Hakodate, Tokyo and Gifu. In this report, we describe its strucute and presents a few analyses, i.e., PCA with unsupervised clustering (X-means or FuzzyCMeans).
### Parameters for graphics
old.par <- par
## core graphics
mfrow.2x3.val <- c(2,3)
mfrow.2x2.val <- c(2,2)
ja.fn <- "HiraKakuPro-W3"
rm.fn <- "Lucida Sans Unicode"
#par(family = eval(ja.fn), mar = c(5,5,5,5), xpd = T, cex = 0.6)
knitr::opts_chunk$set(eval = TRUE, echo = FALSE,
fig.height = 5, fig.width = 6,
par(family = eval(rm.fn), mar = c(4,5,4,5),
xpd = T, cex = 0.6))
### color palette
n.cols <- 11 # This is the maximum
require(RColorBrewer)
## Loading required package: RColorBrewer
my.cols <- rev(brewer.pal(n.cols, "RdYlBu"))
## matplot
ylim.val <- c(0,9)
## viloin plot
vio.ylim.val <- c(0,10)
vio.col.val <- "lightblue"
vio.median_col.val <- "magenta"
vio.box_col.val <- "blue"
vio.box_width.val <- 0.15
The sentences that were used for stimuli are sampled below.
## Loading required package: readxl
The RT data we obtained are sampled below.
## Warning: NAs introduced by coercion
## tibble [5,678 × 12] (S3: tbl_df/tbl/data.frame)
## $ rid : chr [1:5678] "h01" "h01" "h01" "h01" ...
## $ gr : chr [1:5678] "0" "0" "0" "0" ...
## $ sid : chr [1:5678] "s2-161" "s1-054" "s2-221" "s1-122" ...
## $ resp : num [1:5678] 1 2 1 2 1 1 1 2 1 1 ...
## $ RT : num [1:5678] 2.211 0.896 1.492 1.812 2.084 ...
## $ rt1 : num [1:5678] 1.05 0.183 0.85 1 1.217 ...
## $ rt2 : num [1:5678] 1 1.63 1.73 1 1.03 ...
## $ rt3 : num [1:5678] 1.15 1.17 1.27 0.95 1.03 ...
## $ rt4 : num [1:5678] 3.38 1.92 2.89 3.01 3.3 ...
## $ rt5 : num [1:5678] NA 0.896 NA NA NA ...
## $ sane : num [1:5678] 1 1 1 1 1 1 1 1 1 1 ...
## $ place: chr [1:5678] "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...
Since the dataset contains rows generated in an incomplete setting where sane = 0, we first exclude them for the analysis below.
## discard incomplete rows to produce rt.du
rt.du.raw <- subset(rt.dux.raw, sane == 1)
head(rt.du.raw)
str(rt.du.raw)
## tibble [4,902 × 12] (S3: tbl_df/tbl/data.frame)
## $ rid : chr [1:4902] "h01" "h01" "h01" "h01" ...
## $ gr : chr [1:4902] "0" "0" "0" "0" ...
## $ sid : chr [1:4902] "s2-161" "s1-054" "s2-221" "s1-122" ...
## $ resp : num [1:4902] 1 2 1 2 1 1 1 2 1 1 ...
## $ RT : num [1:4902] 2.211 0.896 1.492 1.812 2.084 ...
## $ rt1 : num [1:4902] 1.05 0.183 0.85 1 1.217 ...
## $ rt2 : num [1:4902] 1 1.63 1.73 1 1.03 ...
## $ rt3 : num [1:4902] 1.15 1.17 1.27 0.95 1.03 ...
## $ rt4 : num [1:4902] 3.38 1.92 2.89 3.01 3.3 ...
## $ rt5 : num [1:4902] NA 0.896 NA NA NA ...
## $ sane : num [1:4902] 1 1 1 1 1 1 1 1 1 1 ...
## $ place: chr [1:4902] "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...
We remove outliners before proper analysis.
Here is a description of the data before applying filtering.
## rt1 rt2 rt3 rt4 rt5 RT
## breaks Numeric,62 Numeric,66 Numeric,60 Numeric,35 Numeric,53 Numeric,43
## counts Integer,61 Integer,65 Integer,59 Integer,34 Integer,52 Integer,42
## density Numeric,61 Numeric,65 Numeric,59 Numeric,34 Numeric,52 Numeric,42
## mids Numeric,61 Numeric,65 Numeric,59 Numeric,34 Numeric,52 Numeric,42
## xname "d" "d" "d" "d" "d" "d"
## equidist TRUE TRUE TRUE TRUE TRUE TRUE
The data contains negative values for rt2. This is theoretically impossible but it occurred, perhaps due to misconfiguration of experiments. They are removed first.
## tibble [4,900 × 12] (S3: tbl_df/tbl/data.frame)
## $ rid : chr [1:4900] "h01" "h01" "h01" "h01" ...
## $ gr : chr [1:4900] "0" "0" "0" "0" ...
## $ sid : chr [1:4900] "s2-161" "s1-054" "s2-221" "s1-122" ...
## $ resp : num [1:4900] 1 2 1 2 1 1 1 2 1 1 ...
## $ RT : num [1:4900] 2.211 0.896 1.492 1.812 2.084 ...
## $ rt1 : num [1:4900] 1.05 0.183 0.85 1 1.217 ...
## $ rt2 : num [1:4900] 1 1.63 1.73 1 1.03 ...
## $ rt3 : num [1:4900] 1.15 1.17 1.27 0.95 1.03 ...
## $ rt4 : num [1:4900] 3.38 1.92 2.89 3.01 3.3 ...
## $ rt5 : num [1:4900] NA 0.896 NA NA NA ...
## $ sane : num [1:4900] 1 1 1 1 1 1 1 1 1 1 ...
## $ place: chr [1:4900] "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...
We decided to use standard deviance (sd) filtering after comparing it with the one using Mahalanobis distance.
In addition to exclusion of responses above threshold (sd.ub), the following analysis includes exclusion of outlier responses below threshold (sd.lb), which was not included in the work presented at JCSS37.
## [1] "Set SD upper bound (sd.ub) to 3"
## [1] "Set SD lower bound (sd.lb) to 0.1"
Here is a description of the data after applying sd filtering.
## rt1 rt2 rt3 rt4 rt5 RT
## breaks Numeric,62 Numeric,49 Numeric,60 Numeric,35 Numeric,53 Numeric,43
## counts Integer,61 Integer,48 Integer,59 Integer,34 Integer,52 Integer,42
## density Numeric,61 Numeric,48 Numeric,59 Numeric,34 Numeric,52 Numeric,42
## mids Numeric,61 Numeric,48 Numeric,59 Numeric,34 Numeric,52 Numeric,42
## xname "d" "d" "d" "d" "d" "d"
## equidist TRUE TRUE TRUE TRUE TRUE TRUE
The data contains rt5-free and rt5-ful rows. For the ease of analysis, it is convenient to separate them into two different datasets.
## 'data.frame': 268 obs. of 12 variables:
## $ rid : chr "h01" "h01" "h01" "h01" ...
## $ gr : chr "0" "0" "0" "0" ...
## $ sid : chr "s1-054" "s1-075" "s1-067" "s1-109" ...
## $ resp : num 2 1 2 2 1 1 1 2 2 2 ...
## $ RT : num 0.896 3.571 0.922 0.471 2.866 ...
## $ rt1 : num 0.183 0.867 0.584 0.483 0.217 ...
## $ rt2 : num 1.634 0.767 0.517 0.884 0.784 ...
## $ rt3 : num 1.17 0.8 0.6 1.2 0.55 ...
## $ rt4 : num 1.918 2.835 2.518 0.934 1 ...
## $ rt5 : num 0.896 3.571 0.922 0.471 2.866 ...
## $ sane : num 1 1 1 1 1 1 1 1 1 1 ...
## $ place: chr "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...
## 'data.frame': 4632 obs. of 12 variables:
## $ rid : chr "h01" "h01" "h01" "h01" ...
## $ gr : chr "0" "0" "0" "0" ...
## $ sid : chr "s2-161" "s2-221" "s1-122" "s1-045" ...
## $ resp : num 1 1 2 1 1 1 2 1 1 2 ...
## $ RT : num 2.21 1.49 1.81 2.08 4.35 ...
## $ rt1 : num 1.05 0.85 1 1.22 1.08 ...
## $ rt2 : num 1 1.73 1 1.03 1.67 ...
## $ rt3 : num 1.15 1.27 0.95 1.03 1.93 ...
## $ rt4 : num 3.38 2.89 3.01 3.3 5.55 ...
## $ rt5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ sane : num 1 1 1 1 1 1 1 1 1 1 ...
## $ place: chr "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...
## [1] "number of rt5-free sids: 446"
## [1] "number of rt5-ful sids: 32"
The following are plots of responses by participants, grouped by rt5-freeness.
## [1] "Sampled: TRUE; producing 14 plots"
## [1] "Sampled: TRUE; producing 14 plots"
The following are plots of aggregated responses by stimuli, grouped by rt5-freeness. Unlike the aggregated version below, plots correspond to participants.
The following are plots of aggregated responses by rt5-free stimuli.
## [1] "Sampled TRUE; producing 14 plots"
The following are plots of aggregated responses by rt5-ful stimuli.
## [1] "Sampled: TRUE; producing 14 plots"
## [1] "Ignored s1-001 due to insufficient responses"
The following are plots of aggregated responses by stimuli, grouped by rt5-freeness. Response aggregation was performed by selecting medians for r1, rt2, ..., rt5, RT.
The following are plots of aggregated responses by rt5-free stimuli.
## Loading required package: plotrix
## [1] "Sampled: TRUE; producing 14 plots"
The following are plots of aggregated responses by rt5-ful stimuli.
## [1] "Sampled: TRUE; producing 14 plots"
## [1] "Skipped s1-001 due to insufficient responses"
We apply unsupervised clustering analysis and PCA to raw responeses, differentiated by rt5-freeness.
## Loading required package: clusternor
## Loading required package: Rcpp
## List of 7
## $ nrow : num 4632
## $ ncol : num 5
## $ iters : num 20
## $ k : num 4
## $ centers: num [1:4, 1:5] 1.656 0.635 0.529 0.56 0.898 ...
## $ cluster: int [1:4632] 4 4 4 4 3 4 4 3 4 3 ...
## $ size : int [1:4] 59 654 1358 2561
## Loading required package: MASS
## List of 7
## $ nrow : num 268
## $ ncol : num 6
## $ iters : num 20
## $ k : num 3
## $ centers: num [1:3, 1:6] 2.446 0.478 0.527 0.853 0.564 ...
## $ cluster: int [1:268] 2 3 2 2 3 1 1 2 2 3 ...
## $ size : int [1:3] 32 126 110
Multivariate analysis applied to raw responoses is not revealing. What we want to know is properties of stimuli which are distributed over raw responses, and therefore latent at best. So, we then apply unsupervised clustering analysis and PCA to aggregated responeses, differentiated by rt5-freeness.
## num [1:446, 1:5] 0.333 0.417 0.401 0.417 0.37 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:446] "s1-001" "s1-002" "s1-003" "s1-004" ...
## ..$ : chr [1:5] "rt1" "rt2" "rt3" "rt4" ...
We now cluster rt5-free aggregated responses. The result is the following.
## List of 7
## $ nrow : num 446
## $ ncol : num 5
## $ iters : num 20
## $ k : num 4
## $ centers: num [1:4, 1:5] 0.462 0.448 0.47 0.437 0.469 ...
## $ cluster: int [1:446] 1 4 4 3 2 4 1 3 4 2 ...
## $ size : int [1:4] 107 38 153 148
We now plot PCA of rt5-free aggregated responses.
It turned out that we should take out a few outlier sids below to get a beter understanding of the data under scrutiny.
## [1] "Removing outlier sids:"
## [1] "s1-069" "s1-133" "s1-022" "s1-004"
## num [1:442, 1:5] 0.333 0.417 0.401 0.37 0.461 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:442] "s1-001" "s1-002" "s1-003" "s1-005" ...
## ..$ : chr [1:5] "rt1" "rt2" "rt3" "rt4" ...
## List of 7
## $ nrow : num 442
## $ ncol : num 5
## $ iters : num 20
## $ k : num 4
## $ centers: num [1:4, 1:5] 0.469 0.437 0.462 0.448 0.461 ...
## $ cluster: int [1:442] 3 2 2 4 2 3 1 2 4 3 ...
## $ size : int [1:4] 149 148 107 38
We now plot PCA of rt5-free aggregated responses without outliers.
## num [1:32, 1:6] 0.217 0.339 0.871 3.953 0.378 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:32] "s1-001" "s1-009" "s1-018" "s1-019" ...
## ..$ : chr [1:6] "rt1" "rt2" "rt3" "rt4" ...
We now cluster rt5-ful aggregated responses using FuzzCMeans because X-means doesn't work.
## List of 7
## $ nrow : num 32
## $ ncol : num 6
## $ iters : num 87
## $ k : num 6
## $ centers: num [1:6, 1:6] 0.217 0.389 3.94 0.623 0.43 ...
## $ cluster: int [1:32] 1 2 2 3 5 4 5 6 5 5 ...
## $ size : int [1:6] 1 4 1 4 14 8
We now plot the PCA of rt5-ful aggregated responses.
We end this report with clustewise plots of aggregated responses.
We first plot rt5-free aggregated responses clusterwise.
## [1] "Sampled: TRUE; procuding 14 plots for cluster 1"
## [1] "Sampled: TRUE; procuding 14 plots for cluster 2"
## [1] "Sampled: TRUE; procuding 14 plots for cluster 3"
## [1] "Sampled: TRUE; procuding 14 plots for cluster 4"
We then plot rt5-ful aggregated responses clusterwise.
## [1] "Skipped due to insufficinet number of responses"
Responses we used for the current analysis are far from representative. They are too few and not varied enough. It is deadly necessary to reach a large pool of participants, with varied background hopefully, to make our results robust and more reliable. But we need (more) money to do that, honestly.